[Content Understanding] Add Copilot skills for custom-analyzer authoring#47218
[Content Understanding] Add Copilot skills for custom-analyzer authoring#47218chienyuanchang wants to merge 22 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Adds two new GitHub Copilot skills under .github/skills/ of the azure-ai-contentunderstanding package that walk users through authoring custom analyzers end-to-end (single-doc-type and classify-and-route variants), plus a shared pure-Python schema validator and 19 unit tests. No public SDK API changes; assets live in .github/ and are excluded from the sdist.
Changes:
- New skills
cu-sdk-generate-analyzerandcu-sdk-generate-analyzer-classify-routewith helper scripts (extract_layout.py,create_and_test.py,create_and_test_router.py), shell wrappers, and JSON templates. - New shared pure-Python validator (
_shared/schema_validator.py) that catchesbaseAnalyzerIdtypos and structural errors before any service call. - Updates to
cu-sdk-common-knowledge/cu-sdk-sample-runSKILL.md and the package README, plus a smalltestpreparer.pyenhancement (create_client_from_credentialendpoint trailing-slash normalization) and 3 new test modules.
Reviewed changes
Copilot reviewed 22 out of 22 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/testpreparer.py | Add create_client_from_credential override that strips trailing / from endpoints. |
| tests/test_skills_shared_schema_validator.py | Unit tests for the shared validator (purity, accept/reject cases). |
| tests/test_skills_create_and_test.py | Unit tests for single-analyzer script: --help, validator-first behavior, leaf-row summarize. |
| tests/test_skills_classify_route_router.py | Unit tests for router script: per-category denominator, alias wiring, prebuilt passthrough. |
| README.md | Add rows for the two new skills in the Available Skills table. |
| .gitignore | Update comment on .local_only/ (does not actually add new ignores, contrary to PR description). |
| .github/skills/cu-sdk-setup/SKILL.md | Note that step numbering is referenced by the new skills. |
| .github/skills/cu-sdk-sample-run/SKILL.md | Add "next step" hints linking to the two new skills. |
| .github/skills/cu-sdk-common-knowledge/SKILL.md | Add two-stage pipeline rule, baseAnalyzerId table, classify-and-route rules. |
| .github/skills/_shared/README.md | Document the _shared/ library directory rules. |
| .github/skills/_shared/schema_validator.py | Pure-stdlib validator for analyzer schemas. |
| .github/skills/cu-sdk-generate-analyzer/SKILL.md | New single-doc-type analyzer authoring skill. |
| .github/skills/cu-sdk-generate-analyzer/scripts/README.md | Quick reference for the two helper scripts. |
| .github/skills/cu-sdk-generate-analyzer/scripts/extract_layout.{py,sh} | Stage 1 layout extraction helper. |
| .github/skills/cu-sdk-generate-analyzer/scripts/create_and_test.{py,sh} | Stage 2 validate→create→batch-test→summarize helper. |
| .github/skills/cu-sdk-generate-analyzer/templates/schema_template.json | Starter single-type schema template. |
| .github/skills/cu-sdk-generate-analyzer-classify-route/SKILL.md | New classify-and-route authoring skill (contains a duplicated [ASK USER] block). |
| .github/skills/cu-sdk-generate-analyzer-classify-route/scripts/create_and_test_router.{py,sh} | Router script: validate, create inner+outer analyzers, batch test, category-aware summary. |
| .github/skills/cu-sdk-generate-analyzer-classify-route/templates/classifier_template.json | Starter outer-classifier schema template. |
| @@ -695,6 +697,8 @@ This project has adopted the [Microsoft Open Source Code of Conduct][code_of_con | |||
| [cu_sdk_setup_skill]: https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/contentunderstanding/azure-ai-contentunderstanding/.github/skills/cu-sdk-setup | |||
| [cu_sdk_sample_run_skill]: https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/contentunderstanding/azure-ai-contentunderstanding/.github/skills/cu-sdk-sample-run | |||
There was a problem hiding this comment.
Please add a section of What's New or Changelog, and provide a link to the CHANGELOG. This helps user to track new additions easily.
And we should update CHANGELOG for these new skills too
There was a problem hiding this comment.
Will we release as a new version? Or do we just add into CHANGELOG and wait for the next release?
| When using `config.contentCategories` to classify and route mixed-document | ||
| packets: | ||
|
|
||
| 1. **Category descriptions follow the same text-anchored rule** as field |
There was a problem hiding this comment.
The category descriptions should be generic enough, and should not enforce the classification to use hardcoded values unless that it's important information.
There was a problem hiding this comment.
Updated the example JSON in classify-route Step 3 to use generic, content-kind descriptions and added a callout block right after the example warning the agent not to copy verbatim.
…r-sklls # Conflicts: # sdk/contentunderstanding/azure-ai-contentunderstanding/.github/skills/cu-sdk-common-knowledge/SKILL.md # sdk/contentunderstanding/azure-ai-contentunderstanding/CHANGELOG.md
Sphinx and link-check both flag the relative "samples/sample_create_classifier.py" link in the What's New section. Use the reference-style absolute URL pattern matching the other sample links in this README.
| > Repeat until all key fields reach **fill rate ≥ 80%** and | ||
| > **avg confidence ≥ 0.85**, or the user is satisfied. | ||
| > | ||
| > Stop and report to the user when any of: |
There was a problem hiding this comment.
I tried this, and found that the skill should report out the following so that the user knows what to do from here:
- The name of the final analyzer ID
- The file path to the final iteration of the schema file
- Point to SDK sample for custom analyzer creation and use of custom analyzers
Description
Adds two Copilot skills under
.github/skills/in theazure-ai-contentunderstandingpackage that walk SDK users (and AI coding agents) through creating a custom analyzer end-to-end using the typedContentUnderstandingClientalready shipped in the package.Zero public SDK API changes. Skills + scripts live under
.github/, which is excluded from the PyPI sdist via the existing include-onlyMANIFEST.in—pip install azure-ai-contentunderstandingis byte-identical for consumers.cu-sdk-generate-analyzercu-sdk-generate-analyzer-classify-routeEach skill walks: env check → layout extraction → schema authoring (starting from a template) → local validation → analyzer create → batch test → category-aware stdout summary with leaf-level field rollout → optional ephemeral cleanup.
What's in this PR
.github/skills/_shared/schema_validator.pyazure.*/requests/urllibimports. Allow-list catches theprebuilt-documentAnalyzertypo class before any service call..github/skills/cu-sdk-generate-analyzer/SKILL.md+scripts/{extract_layout,create_and_test}.py+.shwrappers +templates/schema_template.json.github/skills/cu-sdk-generate-analyzer-classify-route/SKILL.md+scripts/create_and_test_router.py+.shwrapper +templates/classifier_template.json.github/skills/cu-sdk-common-knowledge/SKILL.md.github/skills/cu-sdk-sample-run/SKILL.mdREADME.md.gitignore.local_only/rule already covers.local_only/layout/,.local_only/schemas/, and.local_only/test_results/written by the skill scripts.tests/test_skills_*.pyEnd-to-end smoke runs
Both skills were executed against
samples/sample_files/mixed_financial_docs.pdf(a packet containing invoice + bank statement + loan application). Ephemeral cleanup of all created analyzers confirmed for both runs.Single analyzer (
cu-sdk-generate-analyzer)18 leaf rows reported, 100% fill rate across all extracted fields. Lowest confidence 0.661 on
lineItems[].description— exactly the "agent reads the summary, proposes targeted v2 edits" loop the skill is designed for.Classify-and-route (
cu-sdk-generate-analyzer-classify-route)3 categories correctly identified, 9 fields × 100% fill. Per-category denominator verified (each category's fill rate counts only segments classified into that category, not packet-wide total).
All SDK Contribution checklist
.github/is not in the sdist.github/and do not appear in any released wheel. Happy to add a "Other Changes" CHANGELOG entry if reviewers prefer.General Guidelines and Best Practices
Testing Guidelines
tests/test_skills_*.py(validator purity, classifier wiring, prebuilt-routing passthrough, single-analyzer argparse + invalid-schema-pre-import + leaf-row summary). Helper scripts are thin wrappers over the typed SDK; the underlying API calls are already covered by the SDK's recorded tests, so the manual smoke run captured above covers the end-to-end path.Verification steps